1 Project Introdution

Part of the DAND project, the EDA project lead the student to analyse one of the following dataset: * White Wine quality dataset * Red Wine quality dataset * Financial contribution to Presidential campaing by states * Loan Data from prosper * Student dataset

1.1 Details of this project

1.1.1 First look at the Data

The dataset chosen is the Red Wine Dataset, it encompasses 11 variables and 1 ‘output’ varible. The two datasets are related to red and white variants of the Portuguese “Vinho Verde” wine.

The dataset is composed of 1,599 observation. For each observation comes the following variables:

1.1.1.1 Input variables (based on physicochemical tests):

  • fixed acidity (tartaric acid - g / dm^3)
  • volatile acidity (acetic acid - g / dm^3)
  • citric acid (g / dm^3)
  • residual sugar (g / dm^3)
  • chlorides (sodium chloride - g / dm^3
  • free sulfur dioxide (mg / dm^3)
  • total sulfur dioxide (mg / dm^3)
  • density (g / cm^3)
  • pH
  • sulphates (potassium sulphate - g / dm3)
  • alcohol (% by volume)

1.1.1.2 Output variable (based on sensory data):

  • quality: score between 0 (very bad) and 10 (very excellent).

1.2 Limitation linked to the dataset

Wine making is a complex process which involves way more than chemical compounds of the Wine itself, environmental variables linked to the wineyard such as year, temperature, locations but also linked to the winemakers such as the vinification, age of the wine and such are also components that could influence the wine.

last but not the least, the Quality is rated by 3 Wine experts. Given the wide range of taste, experience, and trends existing in the Wine Market. The rating could change over time.

Nevertheless, the experiment remains significant and will bring light on which of the measured variables influence the decision of the Wine Experts.

1.3 Goal of the analysis

The data analysis will be ‘guided’ by the following question: Which chemical properties influence the quality of red wines?

To answer this question different prediction models will be explored to see which fit the best the data. As introduced in the dataset explanations we already know which model ‘fit’ the best the data, “Several data mining methods were applied to model these datasets (i.e. red and white wine dataset) under a regression approach. The support vector machine model achieved the best results.”

It is expected to have, at the end of the data analysis results that which are the variables that influence or not the quality of the wine.

The analysis will be performed only on the provided variables and, if anys, engineered variables.

2 Univariate Plot Section

2.1 fixed Acidity

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

The Fixed acidity measure follow a slightly risght skewed normal distribution, Median is at 7.9 g / dm^3 and mean value is at 8.32 g / dm^3.

2.2 Volatile Acidity

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

Volative acidity levels follows a slightly right skewed distribution.

2.3 Citric Acid

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

This distribution has a lot of peaks and it is difficult to determine by seeing reading the vizualization what would be an average or median measure of Citric Acidity.

2.4 Residual Sugar

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

Lot of bins are emtpy (this could be solved by adjusting the plot), I decided to let it because it is highly probable that the mean of measure was not precise enough to details the data to the precision of the histogram.

Neverthesless, the distribution looks slightly right skewed with a median ar 2.2 g / dm^3 and a mean at 2.539 g / dm^3.

2.5 Chlorides

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

Chlorides measures follows a normal distribution.

2.6 Free Sulfur Dioxide

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00

Free sulfur dioxide follows a right skewed distribution, median value measured around 14 and mean around 16.

2.7 Total Sulfur Dioxide

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    6.00   22.00   38.00   46.43   62.00  289.00       2

Total sulfur dioxide follows a right skewed distribution with a median measured at 38 mg / dm^3 and a mean around 46 mg / dm^3.

2.8 Density

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0040

Density follows a normal distribution (centred below 1 g / cm^3, which is the density of water)

2.9 pH

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

the measure of pH follows a normal distribution, median / mean around 3.3.

2.10 Sulphates

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

Sulphates follows a slightly right skewed distribution, median around 0.6g / dm^3, mean around 0.65g / dm^3.

2.11 Alcohol

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

Slightly right skewed distribution with a mean around 10.5 % Alcohol and a median around 10 %

2.12 General ratings

The plot below shows the distribution of Wine ratings

2.13 Multivariate Exploration

2.13.1 volatile acidity levels exploration

2.13.2 Chlorides levels exploration

2.13.3 Total sulfur dioxide levels vizualization

2.13.4 Total sulfur dioxide levels vizualization

2.13.5 Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

2.13.6 Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Alcohol and density are highly corelated, which I guess is normal as Ethanol is lither than Water.

2.13.7 What was the strongest relationship you found?

One of the strong relationship found is between fixed acidity, citric acid, density and pH. This relationship seems logic as it measures somehow very similar characteristics.

2.13.7.1 alcohol and chlorides

3 Multivariate Plots Section: Significant variables

3.1 Vizualization of the significant variables

In order to guide and determines Significant variables, we plot it using the corrplot function:

The result shows an issue in the data with total sulfur dioxide.

## 
## FALSE  TRUE 
##  1597     2

The Table of NAs shows only 2 missing values, as only 2 are missing out, an appropriate cleaning strategy could be to remove them or populate them with the mean of the sample.

The latter strategy will be used.

3.1.1 Interesting relationship between the variables:

The correlation plot shows two small group of variables interacting: * Fixed acidity, citric acidity and PH : all the variables relates to the acidity of the wine. * Density and alcohol : Alcohol have a different density than water thus influencing on density measured.

3.1.2 Exploration of non correlated variables:

Some variables are very close to 0 in the correlation matrix, sulphate and residual sugar for example, free sulfur dioxide and chlorides, fixe acidity and alcohol. b

fixed acidity explanations:

http://waterhouse.ucdavis.edu/whats-in-wine/fixed-acidity

3.1.3 Density multivariate exploration

3.1.4 pH multivariate exploration

3.1.5 alcohol multivariate exploration

3.1.6 Chlorides multivariate exploration

3.1.7 Predictive Model

## 
## Call:
## lm(formula = quality ~ ., data = DT.redwine)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -1.415e-12 -5.000e-16  1.090e-15  2.420e-15  5.226e-14 
## 
## Coefficients:
##                        Estimate Std. Error    t value Pr(>|t|)    
## (Intercept)           3.000e+00  1.184e-12  2.533e+12   <2e-16 ***
## fixed.acidity         5.008e-16  1.449e-15  3.460e-01    0.730    
## volatile.acidity     -8.778e-15  6.946e-15 -1.264e+00    0.207    
## citric.acid           1.105e-14  8.191e-15  1.349e+00    0.178    
## residual.sugar        9.324e-16  8.390e-16  1.111e+00    0.267    
## chlorides            -1.032e-14  2.341e-14 -4.410e-01    0.659    
## free.sulfur.dioxide   4.665e-17  1.219e-16  3.830e-01    0.702    
## total.sulfur.dioxide -1.712e-17  4.128e-17 -4.150e-01    0.678    
## density              -1.644e-12  1.210e-12 -1.359e+00    0.174    
## pH                   -8.170e-15  1.072e-14 -7.620e-01    0.446    
## sulphates             4.634e-15  6.504e-15  7.130e-01    0.476    
## alcohol               2.360e-16  1.519e-15  1.550e-01    0.877    
## quality_F4            1.000e+00  1.248e-14  8.016e+13   <2e-16 ***
## quality_F5            2.000e+00  1.167e-14  1.714e+14   <2e-16 ***
## quality_F6            3.000e+00  1.175e-14  2.554e+14   <2e-16 ***
## quality_F7            4.000e+00  1.211e-14  3.303e+14   <2e-16 ***
## quality_F8            5.000e+00  1.466e-14  3.410e+14   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.592e-14 on 1580 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 5.045e+28 on 16 and 1580 DF,  p-value: < 2.2e-16

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

A linear Model has been fitted to the data, the following list are the results (3 stars *** means high statistical relevance).

  • volatile.acidity -1.083e+00 1.212e-01 -8.941 < 2e-16 ***
  • chlorides -1.873e+00 4.195e-01 -4.466 8.54e-06 ***
  • free.sulfur.dioxide 4.503e-03 2.191e-03 2.055 0.0400 *
  • total.sulfur.dioxide -3.292e-03 7.312e-04 -4.502 7.22e-06 ***
  • pH -4.195e-01 1.921e-01 -2.184 0.0291 *
  • sulphates 9.142e-01 1.145e-01 7.986 2.65e-15 ***
  • alcohol 2.760e-01 2.650e-02 10.416 < 2e-16 ***

Among these, the three most insteresting variables too investigates further are Volatile Acidity Sulphates and Alcohol

4 Final Plots and Conclusion

The study shows a high significance between the ratings and and measured levels of Volatile Acidity, Chlorides, total sulfur dioxide, sulphates and Alcohol.

The two following plots are highlighting the significance of the measures on Quality ratings.

4.0.0.1 Note on the Linear Model results:

High significance variables:

  • Volatile Acid
  • chlorides
  • total.sulfur.dioxide
  • sulphates
  • alcohol

Medium significance:

  • free.sulfur.dioxide
  • pH

Non-significant variable:

  • density
  • residual.sugar
  • fixed.acidity
  • citric.acid

This plot is very interesting as it show clearly a mark for Volatile Acidity measured. Around 0.4 g / dm^3 most wines are rated between 7 and 8. It would be interesting to see why this low acidity in wine bring better rating, one hypothesis could be that the wine age well and decrease it acidity over time.

it is important to plot the data to see the relationship between variables. Here, the total sulfur dioxide and alcohol vizualization. it shows clearly two distinct cluster between the wine that are rated low and average (3 to 5) and the others which are rated 6 +.

For the final plot, I chose this violin plot because I think we can learn something very interesting from it just by looking at it.

The Vizualization shows that wines below 11.5 % of Alcohol are usually rated lower than the ones above. Next time you chose a bottle, don’t hesitate to have a quick check on the alcohol rating while many other factor influences the expert ratings, we can see that most of the bad rated wine (5 and below) are below the 11.5% treshold. Also, most of the Wines above 12% are rate 6 and above.

5 Reflection

5.1 Struggle and success on the Analysis

Plotting so much different data and value was a bit tedious defining an harmonized color, scales and all titles for the plot was a long work. But in the end, I submitted this project without working on the scale and after adjusting all of the scale, the plotting part made so much more sense.

The big success is that someone can really read the data through the graph with the scales and colors and identify, by intuition, which are the values that are contributing to the rating of the Wine. Note that is also something to take with caution as we don’t know when the data has been collected and the wine rated. Wine taste evolves with time and follow industry trends. For example if a rating was 8 for a certain wine in 2000, exactly the same wine with the same taste may have another rating in 2017.

The analysis also highlighted for me the importance of plotting the data. The difference between Multivariates/ Bivariate and univariate plots are self explaining and I’ll definetely following this route in my next analyses.

One part, that I think should be further developped in the machine learning section is the linear regression, it have a more mathematic approach and can really guides the analysis and highlight the levels of statistical relevance of each variables.

5.2 Further work

Sharing this analysis with wine expert or chemists would be of great help to further understand the relationship between the data. Ethanol / Alcohol is for example lighter than water and does impact density, now, to which extent impact the other variable to the density? Many some expert knowledge could be used to simplify the models or to put some theoritical values to compare with.

It would have been intersting to plot the relation between the sulphur dioxide levels against the age of the wine and the quality rating, unfortunately this value is missing.

According to ,https://www.wineselectors.com.au/selector-magazine/wine/wine-101/preserving-the-truth-on-sulphates-in-wine, the sulphur dioxide levels decrease with Age as it dissipates over time.